Message Encoding

When composing or reading Internet email, there are two types of encoding to be aware of:

File encoding
Character encoding

This topic attempts to explain why these encoding types are needed when sending/receiving Internet email, and how to properly handle sending/receiving email that contains 8-bit data and international characters.

File Encoding

File encoding is needed when sending/receiving email for the following reason: Many email messages contain 8-bit data. However, the SMTP protocol may limit data transmission to 7-bit data only. As a result, any message containing 8-bit data must be file encoded to make it compatible with SMTP servers using a 7-bit encoding algorithm. Generally, this encoding is done using one of two algorithms:

Quoted-Printable Encoding
Base64 Encoding

Quoted-Printable encoding works like the following: An 8-bit character has 256 possible values. Of these, the low 128 are 7-bit ASCII values and not a problem for transfer through an SMTP server. The high 128 characters require 8-bits of storage and need to be encoded to be able to be transferred. Each 8-bit character is converted to its hexadecimal value and quoted, so decimal 200 becomes the ASCII string value =C8. Since this encoding scheme leaves ASCII characters intact, it is very efficient for encoding data that contains a large amount of ASCII text.

Base64 Encoding scheme works in this way. Groups of 24-bits are broken off. In the data's native form this would represent 3, 8-bit segments. For SMTP transmission, this data is encoded as 4, 6-bit segments.

When a message or message part is encoded by an email client using one of these two encoding algorithms, a header (content-encoding) is added to the message or message part which indicates to the receiving mail client which algorithm to use to properly decode the data.

Once the message has been properly Base64 or Quoted-Printable decoded it is NOT yet properly interpreted. The message contents now have to be interpreted using a character set.

Character Encoding

Once a received email has been file decoded, it still may not be interpreted properly by the receiving mail client. The reason for this is the mail client always has to take into account the character set the email was composed in to properly interpret the strings contained within it. This process is known as character encoding/decoding.

When communicating with international mail clients, you should always be aware of character encoding issues. At the highest level, character encoding/decoding can be viewed as taking a visual string representation displayable in a UI, translating that visual string representation into a multibyte value for transmission between systems, then retranslating the numerical value back into the original string representation. At one time a simple procedure, now made much more complex by the advent of the Internet and international exchange of data.

In the early days of the SMTP protocol, the only string characters that English-speaking developers used were 7-bit ASCII values. For this reason, any time an email was received it was always safe to assume that it was comprised completely of ASCII values for decoding purposes and there was never any problems reading email.

Eventually, as demand required the use of more characters, the original 7-bit ASCII characters was extended into many 8-bit character sets, enabling developers to add an additional 128 characters to their character set definitions. For example, one of these was the Latin1 character set which extended the ASCII characters to include such values as the copyright symbol (©), the trademark symbol (™), and other useful symbols which was used by English speaking people. Now English speaking people could send email containing these non-ASCII symbols to other English speaking people and, both clients assuming that the Latin1 character set was being used, and all characters were interpreted and displayed properly. Meanwhile, in other areas of the world, the original 7-bit ASCII character set was being extended in ways useful for non-English speaking people. For example, the Russian alphabet was added on top of the original ASCII values to form the Cyrillic character set. Now a Russian speaking person could send email to another Russian speaking person using Russian characters and, both clients assuming that the Cyrillic character set was being used, all characters were interpreted and displayed correctly.

The problem became clear when international data transfer became more commonplace. An English speaking person may use a non-ASCII Latin1 character such as the cent symbol (¢). For data transmission, this character would be represented by the sending mail client as the 8-bit value 162 (as specified by the Latin1 character set). A Russian mail application, reading this message, would wrongly convert the 8-bit value 162 using the Cyrillic character set, resulting in the end user seeing the character Ў instead of ¢.

To enable the receiving mail client to properly interpret the characters, the charset parameter was added to the content-type mail header. The value of this parameter maps to a code page which the client can use to use the correct visual representation of a character for a given 8-bit value.

Special note about header encoding (or word encoding)

If a message header line contains 8-bit data it must use a special encoding syntax of the following format:

=?[charset]?[encoding]?[header content]?=

The =? and ?= delimiters at the beginning and end of the header indicate that this header was encoded. The value of [charset] indicates the character set to use to correctly interpret this header. The value of [encoding] indicates the encoding type to use (B for Base64, Q for Quoted-Printable) to correctly decode this header. The value of [header content] contains the encoded content. An example of a Base64 encoded shift-jis header line is below.

=?shift-jis?B?UmU6IA0KUmVsZWFzZSBkYXRlIHV0ayB2ZXIgMi4wID8=?=

Although the example above addresses most Internal email scenarios, it does not address all. Namely, some languages contain many more characters than can possibly be encoded using 8 bits, such as some Asian dialects. For dialects such as this, a multi-byte representation must be used.

Note about Unicode

As the above examples illustrate, using an 8-bit representation is limiting. If a dialect has greater.than 256 characters, there is no way to represent all possible characters. For that reason, the Unicode encoding, using a wide character representation (allowing up to 65,536 potential characters) provides a numerical value for all known characters.

The concept with Unicode is that any known character can always be represented by a multi-byte sequence of 1 or more bytes, and that representation can always be returned to the Unicode character.

When using PowerTCP

At what time do these encodings/decodings occur when using a PowerTCP product? As an example, take an email composed on a machine that uses Shift-JIS as the default character set, then Base64 encoded for SMTP transmission. When a PowerTCP product reads this data, it is parsed on-the-fly and Base64 decoded. After this process, you have an object representation of a message. As the parts of the message are accessed (for example, when reading Message.Text) the content of this message is correctly interpreted using the specified character set.